Backend Python Tutorial#
Welcome to the backend tutorial on how to programmatically use DQMaRC.
We begin by importing the necessary libraries including pandas and the DataQuality class.
In this case, we will also import some functions from the UtilitiesDQMaRC module,
including MetricCalculator
, BarPlotGenerator
,
and DonutChartGenerator
.
These are supplementary functions for quick visualisation of the data quality markup report.
The overall_quality_fx()
function calculates the overall quality.
Source Data#
Import Libraries#
import pandas as pd
from pkg_resources import resource_filename
# Import DQMaRC
from DQMaRC import DataQuality
# Import UtilitiesDQMaRC for data quality visualisation
from DQMaRC.UtilitiesDQMaRC import (
MetricCalculator,
BarPlotGenerator,
DonutChartGenerator,
overall_quality_fx, col_good, col_bad)
DQMaRC: the main package to assess multiple data quality dimensions for a given dataset.
UtilitiesDQMaRC: a suite of modules to support the initial visualisation of the data quality markup report produced by DQMaRC. These utilities are also used to generate the shiny frontend interface.
MetricCalculator
: calculates sum totals and averages of data quality errors for each dimension.BarPlotGenerator
: accepts the summary results produced byMetricCalculator
to generate a plotly barplot for quick visualisation of data quality averages.overall_quality_fx
: a function to generate an overall average of all data quality metrics and fields.col_good
/col_bad
: colour encodings for good and poor data quality percentages.
Data Import#

We will use a synthetic dataset generated by Mockaroo and read it in as a pandas DataFrame. This data is included in the package but is purely synthetic.
# Read in example data as pandas dataframe
#df = pd.read_csv('../DQMaRC/data/toydf_subset.csv')
df = pd.read_csv(resource_filename('DQMaRC', 'data/toydf_subset.csv'))
df
Patient_ID | Gender | Date_of_Diagnosis | Date_of_Birth | Age | Weight_kg | Height_cm | BMI | Postcode | ICD_10_Code | Tumour_Stage | Metastatic_Indicator | Tumour_M_Stage | Regimen_Start_Date | Datetime_Event1 | Datetime_Logging1 | Datetime_Event2 | Datetime_Logging2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Male | 23/04/2033 | 07/06/2090 | 72.0 | NaN | 220.50 | 0.0 | BS5 0NQ | Y743 | Incomplete Stage | Present | M0 | 04/10/2025 | 13/03/2024 00:00 | 13/03/2024 00:07 | 16/03/2024 00:00 | 16/03/2024 00:53 |
1 | 1 | Male | NaN | 07/06/2090 | 181.0 | 882.01 | 184.53 | 68.0 | BT78 3PN | Y743 | Incomplete Stage2 | Present | M0 | 04/10/2025 | 25/03/2024 00:00 | 25/03/2024 00:02 | 19/03/2024 00:00 | 19/03/2024 01:11 |
2 | 2 | . | 06/02/2020 | NaN | 140.0 | 414.60 | 170.05 | 67.0 | UB7 0JP | Unkown | Incomplete Stage2 | Present | M0 | 24/08/2021 | 21/03/2024 00:00 | 21/03/2024 00:01 | 04/03/2024 00:00 | 04/03/2024 00:18 |
3 | 3 | Male | 30/07/2021 | 24/08/2003 | 151.0 | 703.49 | 30.99 | 39.0 | MK9 3NZ | Incomplete Stage2 | Present | M0 | NaN | 12/03/2024 00:00 | 12/03/2024 00:08 | 30/03/2024 00:00 | 30/03/2024 01:28 | |
4 | 4 | Male | 17/09/2010 | 13/07/1999 | NaN | 804.32 | 7.48 | 55.0 | B63 3QX | Incomplete Stage2 | Present | M0 | 24/07/2015 | 20/03/2024 00:00 | 20/03/2024 00:07 | 29/03/2024 00:00 | 29/03/2024 00:47 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
104 | 104 | Female | 26/11/2010 | 01/02/1979 | 122.0 | 34.63 | 159.07 | 98.0 | YO10 3JQ | O3691X1 | Stage 2 | Present | M0 | 10/12/2018 | 1991/03/2024 00:00:00 | 23/03/2024 00:00 | 13/03/2024 00:00 | 01-01-1122 01:25:00 |
105 | 105 | Female | 30/07/2011 | 08/10/1987 | 163.0 | 598.80 | NaN | 7.0 | ME15 6TB | T23129D | Stage3 | Present | M0 | 12/09/2015 | 1991/03/2024 00:00:00 | 23/03/2024 00:00 | 13/03/2024 00:00 | 01-01-1122 01:25:00 |
106 | 106 | Female | 15/11/2014 | 17/10/1925 | 81.0 | 53.69 | 226.65 | 54.0 | L5 2QA | M778 | Stage 4 | Present | M0 | 26/03/2017 | 1991/03/2024 00:00:00 | 23/03/2024 00:00 | 13/03/2024 00:00 | 01-01-1122 01:25:00 |
107 | 107 | Female | 15/03/2018 | 25/10/1974 | 45.0 | 788.70 | 105.50 | 16.0 | N15 4GE | D160 | Stage 3 | Present | M0 | 10/06/2015 | 1991/03/2024 00:00:00 | 23/03/2024 00:00 | 13/03/2024 00:00 | 01-01-1122 01:25:00 |
108 | 108 | Male | 13/10/2017 | 02/06/1948 | 145.0 | 147.52 | 107.68 | 18.0 | KY5 0NE | S66117D | Stage3 | Present | M0 | 01/09/2019 | 1991/03/2024 00:00:00 | 23/03/2024 00:00 | 13/03/2024 00:00 | 01-01-1122 01:25:00 |
109 rows × 18 columns
Test Parameters Setup#
Initialise DQMaRC#
DQMaRC must be initialised by applying the Data Quality class to your source dataset.
DQMaRC then requires configuration of the test_params attribute
.
This is a dataset that DQMaRC uses to map which test parameters (or data quality tests) should be applied
to the relevant fields or variables from the source dataset.
Each test parameter is functionally supported by a respective method provided by DQMaRC.
For example, the Completeness_NULL
test parameter is calculated by the test_null() method
.
If this is your first time using DQMaRC, you can automatically generate a test_params
template using the get_param_template() method
.
This template is based off the source dataset and automatically
activates the Completeness and Uniqueness parameters for all source data fields. In other words, users
can run DQMaRC using the auto-generated template to see what the outputs look like and determine whether
this is the right tool for the job.
We strongly encourage users to get familiar with the test parameters and to take the time to customise it. All test parameters are listed and explained in a table below. We recommend customising them using Excel due to ease of use, but python users also have the option to customise test parameters programmatically. We demonstrate how to this here.
# Initialise a DQ object by passing your data to the tool
dq = DataQuality(df)
# Retrieve default test parameter form the object. We will edit this in the next step
test_params = dq.get_test_params()
# View the test parameters template
test_params
Warning - Using default parameters. For best results edit parameters and reload using set_test_params() method
Field | Date_Format | Completeness_NULL | Completeness_Empty | Completeness_Encoded | Completeness_Encoded_Mapping | Uniqueness_Rows | Consistency_Compare | Consistency_Compare_Field | Consistency_Compare_Mapping | ... | Validity_NHS_Number | Validity_Postcode_UK | Validity_Lookup_Table | Validity_Lookup_Type | Validity_Lookup_Codes | Validity_Range | Validity_Range_Numeric | Validity_Pattern | Validity_Pattern_Regex | Accuracy_Gold_Standard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Patient_ID | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
1 | Gender | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
2 | Date_of_Diagnosis | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
3 | Date_of_Birth | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
4 | Age | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
5 | Weight_kg | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
6 | Height_cm | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
7 | BMI | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
8 | Postcode | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
9 | ICD_10_Code | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
10 | Tumour_Stage | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
11 | Metastatic_Indicator | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
12 | Tumour_M_Stage | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
13 | Regimen_Start_Date | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
14 | Datetime_Event1 | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
15 | Datetime_Logging1 | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
16 | Datetime_Event2 | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False | |
17 | Datetime_Logging2 | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False |
18 rows × 30 columns
Test Parameters Configurations#
Now we will demonstrate how to edit the test parameters
programmatically.
You can also export the test parameters
template as a csv file and edit them in MS Excel.
Datetime Format#
The first test parameter to edit is the Date_Format
. Although this is not strictly speaking a data quality metric,
it does help python to correctly identify and calculate other data quality metrics for datetime fields.
# Datetime format
test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Date_Format'] = "%d/%m/%Y"
test_params.loc[test_params['Field']=='Date_of_Birth', 'Date_Format'] = "%d/%m/%Y"
test_params.loc[test_params['Field']=='Datetime_Event1', 'Date_Format'] = "%d/%m/%Y %H:%M"
test_params.loc[test_params['Field']=='Datetime_Logging1', 'Date_Format'] = "%d/%m/%Y %H:%M"
# Another way to do this:
# test_params.at[3,'Date_Format']="%d/%m/%Y" # Date of Diagnosis
# test_params.at[4,'Date_Format']="%d/%m/%Y" # Date of Birth
# test_params.at[15,'Date_Format']="%d/%m/%Y %H:%M" # Datetime_Event1
# test_params.at[16,'Date_Format']="%d/%m/%Y %H:%M" # Datetime_Logging1
Completeness Parameters#
The first data quality dimension we will edit is Completeness. When you initialise the
test parameters
template, DQMaRC assumes that the
user may wish to calculate the number of NULL or Empty records for all fields. As such, the default values for
Completeness_NULL
(see the test_null() method
) and
Completeness_Empty
(see the test_empty() method
) are TRUE
.
We encourage the user to think of or look for other values that may signify missing data.
For example, values such as “Unknown”, “Not Known”, or “na” may indicate data missingness. To do this, we set the test parameter
Completeness_Encoded
== TRUE
(see the test_na_strings() method
)
for the field in question (in this case it is “Gender”). Then we include the relevant missing codes in the Completeness_Encoded_Mapping
.
Each unique possible value or string encoding for missing data is separated by a pipe, e.g. ‘Unknown|Not Known|na’ (i.e. Unknown or Not Known or na)
# Completeness
# Set to true for all fields
test_params['Completeness_NULL'] = True # Default value
test_params['Completeness_Empty'] = True # Default value
# Set Completeness Encoding only for the "Gender" column
test_params.at[2,'Completeness_Encoded']=True
test_params.at[2,'Completeness_Encoded_Mapping']='Unknown|Not Known|na' # use pipes for multiple possible codes for missingness
Uniqueness Parameters#
Uniqueness evaluates if records are unique across the fields that are set to TRUE
. In most cases, we assume that records should be unique
across at least all fields combined.
Therefore, DQMaRC by default sets the parameter Uniqueness_rows
== TRUE
for all fields.
However, if records are expected to be unique across select fields, then we encourage the user to set TRUE
only to those respective fields, and FALSE
to the remaining fields.
# Uniqueness
# Set to true for all fields
test_params['Uniqueness_rows'] = True # Default value
Consistency Parameters#
Consistency is the evaluation of whether data between two fields in the same dataset are consistent as expected. It consists of two key metrics,
including Consistency_Compare
(see the test_one_to_one() method
),
which is the comparison of values between two fields, and
Consistency_Date_Relations
(see the date_relationships() method
), which is a check of
consistency in logical relationships between two datetime fields.
We demonstrate the first consistency metric by comparing Metastatic_Indicator
with Tumour_M_Stage
.
In theory, both variables indicate whether cancer has spread to distant locations, but in real-world it is possible that
these may come from different sources. The values in both fields are recorded differently, but mean similar things.
For example, if Metastatic_Indicator
== Absent
, then we expect that Tumour_M_Stage
== M0
, meaning no metastasis is present.
If Metastantic_Indicator
== Present
, then we expect to see Tumour_M_Stage
with either M1
, M1a
, or other values, but here
we abbreviate it to keep things simple.
For datetime fields, we may expect one date to occur before or after another in the same dataset. In this example, we demonstrate this using
Date_of_Birth
which obviously should occur before Date_of_Death
. Importantly, we tell DQMaRC to raise an error if it detects that:
Date_of_Birth
> Date_of_Death
.
# Consistency
# Set up consistency checks between the "Metastatic_Indicator" and "Tumour_M_Stage" columns
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare'] = True
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare_Field'] = 'Tumour_M_Stage'
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare_Mapping'] = '{["Absent"]: ["M0"], ["Present"]: ["M1", "M1a"]}'
# Set up consistency checks between date fields "Date_of_Birth" and "Date_of_Diagnosis"
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relations'] = True
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relations_Field'] = 'Date_of_Diagnosis'
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relationship'] = '>' # i.e. raise an error if Date of Birth > Date of Diagnosis
Timeliness Parameters#
Timeliness describes how fresh or up-to-date data is.
One way to measure this is to define a threshold in time difference (see the date_diff_calc() method
) between fields that record
the timing of events and when those events are captured into the system. For example, in this case, the Datetime_Event1
field signifies the time
that a user says an event occured, or when that data was observed in the real-world, whereas Datetime_Logging1
is the electronically generated datetime
log data indicating when the user entered this data into the system. In a clinical or health setting, the freshness of data may be crucial.
For example, in critical care, patient observation data may be recorded several minutes or hours after the time of observation, instead of immediately.
To flag these errors, we tell DQMaRC to raise an error if the time difference between Datetime_Event1
and Datetime_Logging1
is greater than
10 minutes.
# Timeliness
# Raise an error if a time difference threshold of 10 minutes is exceeded
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff'] = True
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff_Field'] = 'Datetime_Logging1'
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff_Threshold'] = '10' # i.e. raise an error if timediff >10 minutes
Validity Parameters#
Data Validity describes data that conform to the expected standards, patterns, or ranges that reflect the intended real-world objects. DQMaRC has 7 validity metrics that are shown below. These include:
Future Dates: flag a record containing a date that occur in the future (see
Validity_Dates_Future() method
).# (1) Future Dates test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Dates_Future'] = True test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Dates_Future'] = True
Date Outliers: checks if date records fall outside expected date ranges (see
min_max_dates() method
).# (2) Date Outliers test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range'] = True test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range'] = True test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range_Min'] = '2011-01-01' test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range_Min'] = '1900-01-01' test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range_Max'] = '2023-07-07' test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range_Max'] = '2023-01-01'
Numerical Outliers: checks if numerical data fall outside expected numerical ranges (see
test_ranges() method
).# (3) Numerical Outliers test_params.loc[test_params['Field']=='Age', 'Validity_Range'] = True test_params.loc[test_params['Field']=='Age', 'Validity_Range_Min'] = 0 test_params.loc[test_params['Field']=='Age', 'Validity_Range_Max'] = 120 test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range'] = True test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range_Min'] = 20 test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range_Max'] = 210
NHS Number Validator: checks if NHS numbers are valid (see
validate_nhs_number() method
).# (4) NHS Number Validator # test_params.loc[test_params['Field']=='NHS_Number', 'Validity_NHS_Number'] = True
UK Postcode Validator: checks if UK postcodes are valid (see
test_postcode() method
).# (5) UK Postcode Validator test_params.loc[test_params['Field']=='Postcode', 'Validity_Postcode_UK'] = True
Data Standards: checks if values conform to expected permissible values as defined by a data standard (see
test_against_lookup_tables() method
).Either create and save your own data standard
# (6a) Data Standards test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup'] = True test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup_Type'] = 'Values' test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup_Codes'] = 'M0|M1|M1b|pM1'
Or access a pre-defined list saved as a csv file
#(6b) Use an external csv data standard list of valid codes lu_filename = '..DQMaRC/data/lookups/LU_toydf_gender.csv' # Here we will apply a pre-defined data standard for "gender" test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup'] = True test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup_Type'] = 'File' test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup_Codes'] = lu_filename
Regular Expression Pattern: checks if data conform to expected pattern as defined by regular expression (see
test_pattern_validity() method
).# (7) Regular Expression Pattern test_params.loc[test_params['Field']=='Datetime_Event1', 'Validity_Pattern'] = True test_params.loc[test_params['Field']=='Datetime_Event1', 'Validity_Pattern_Regex'] = "(\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2})" test_params.loc[test_params['Field']=='Datetime_Event2', 'Validity_Pattern'] = True test_params.loc[test_params['Field']=='Datetime_Event2', 'Validity_Pattern_Regex'] = "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}"
Accuracy Parameters#
Accuracy measures the consistency of input data compared to a known gold standard
. In this case, we will set the gold standard
as the input data itself to demonstrate the functionality.
# Accuracy
# Set a manually validated version of the data set as the gold standard
test_params['Gold_Standard'] = True
# supply gold stand data - we are using the same dataset here for ease
dq.accuracy.set_gold_standard(df)
Test Parameters Definitions#
Here is a list of all test parameters
and their respective descriptions and permissible configurations.
# Read in example data as pandas dataframe
test_params_definitions = pd.read_csv('../DQMaRC/data/test_params_definitions.csv')
test_params_definitions
Test Parameter | Python_Data_Type | Example_Data | Description | |
---|---|---|---|---|
0 | Field | String | Metastatic_Indicator | Name of the fields of the source input data. |
1 | Date_Format | String | %d/%m/%Y %H:%M | The datetime format pattern specified by ISO s... |
2 | Completeness_NULL | Boolean | TRUE or FALSE | Count of "NULL" values. |
3 | Completeness_Empty | Boolean | TRUE or FALSE | Count of empty records. |
4 | Completeness_Encoded | Boolean | TRUE or FALSE | TRUE/FALSE indicator if field uses codes to in... |
5 | Completeness_Encoded_Mapping | String | Missing|N/A|Empty | A list of values or codes that represent missi... |
6 | Uniqueness_Rows | Boolean | TRUE or FALSE | TRUE/FALSE indicator if field is expected to b... |
7 | Consistency_Compare | Boolean | TRUE or FALSE | TRUE/FALSE to indicate if field should be comp... |
8 | Consistency_Compare_Field | String | Tumour_M_Stage | Enter the name of the field to compare against... |
9 | Consistency_Compare_Mapping | Pandas dictionary | {"Present": ["M1", "M1a"]} | Map the permissible correct comparison values ... |
10 | Consistency_Date_Relations | Boolean | TRUE or FALSE | TRUE/FALSE if a date field should be compared ... |
11 | Consistency_Date_Relationship | String | <, >, <=, >= | Relationship of the comparison (<, >, <=, or >... |
12 | Consistency_Date_Relations_Field | String | Date_of_Birth | Name of the compared datetime field if "Consis... |
13 | Timeliness_Date_Diff | Boolean | TRUE or FALSE | Indicator to perform a calculation between two... |
14 | Timeliness_Date_Diff_Field | String | Datetime_Logging1 | If "Timeliness_Date_Diff" = "TRUE", here you s... |
15 | Timeliness_Date_Diff_Threshold | Integer | 10 | What is the permissible threshold (in minutes)... |
16 | Validity_Dates_Future | Boolean | TRUE or FALSE | TRUE/FALSE indicator to raise an error if a fu... |
17 | Validity_Date_Range | Boolean | TRUE or FALSE | TRUE/FALSE if a date range is applicable. |
18 | Validity_Date_Range_Min | String | 01/01/1900 | If "Validity_Dates_Future" = "TRUE", what is t... |
19 | Validity_Date_Range_Max | String | 01/01/2025 | If "Validity_Dates_Future" = "TRUE", what is t... |
20 | Validity_NHS_Number | Boolean | TRUE or FALSE | TRUE/FALSE if the NHS validity algorithm check... |
21 | Validity_Postcode_UK | Boolean | TRUE or FALSE | TRUE/FALSE to raise an error if invalid UK pos... |
22 | Validity_Lookup_Table | Boolean | TRUE or FALSE | TRUE/FALSE if a data standard is applicable. |
23 | Validity_Lookup_Table_Filename | String | LU_toydf_ICD10_v5.csv | If "Validity_Lookup_Table"="TRUE", what is the... |
24 | Validity_Range | Boolean | TRUE or FALSE | TRUE/FALSE if a numerical range applicable, e.... |
25 | Validity_Range_Numeric | String | 0|110 | If "Validity_Range"="TRUE", supply the numeric... |
26 | Validity_Pattern | Boolean | TRUE or FALSE | TRUE/FALSE if a valid pattern is applicable. |
27 | Validity_Pattern_Regex | String | [0-9]{2}/[0-9]{2}/[0-9]{2} | If "Validity_Pattern"="TRUE", supply the permi... |
28 | Accuracy_Gold_Standard | Boolean | TRUE or FALSE | TRUE/FALSE if a gold standard reference datase... |
Upload Custom Test Parameters#
We have shown how you can programmatically edit the test parameters
dataframe. However, users may opt to make these edits
in Excel instead. In fact, we encourage users to spend the time to do this to maximise the relevance of the data quality output reports.
Here we show how you can upload a pre-defined test parameters=.
test_params_upload = pd.read_csv(resource_filename('DQMaRC', 'data/toydf_subset_test_params_24.05.16.csv'))
test_params_upload
Field | Date_Format | Completeness_NULL | Completeness_Empty | Completeness_Encoded | Completeness_Encoded_Mapping | Uniqueness_Rows | Consistency_Compare | Consistency_Compare_Field | Consistency_Compare_Mapping | ... | Validity_NHS_Number | Validity_Postcode_UK | Validity_Lookup_Table | Validity_Lookup_Type | Validity_Lookup_Codes | Validity_Range | Validity_Range_Numeric | Validity_Pattern | Validity_Pattern_Regex | Accuracy_Gold_Standard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Patient_ID | NaN | True | True | False | NaN | True | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False |
1 | Gender | NaN | True | True | True | .|unknown|null|not known|na|n/a|none | False | False | NaN | NaN | ... | False | False | True | File | LU_toydf_gender.csv | False | NaN | False | NaN | False |
2 | Date_of_Diagnosis | %d/%m/%Y | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | (\d{2})/(\d{2})/(\d{4}) | False |
3 | Date_of_Birth | %d/%m/%Y | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | (\d{2})/(\d{2})/(\d{4}) | False |
4 | Age | NaN | True | True | False | NaN | False | False | NaN | NaN | ... | NaN | NaN | False | NaN | NaN | True | 0|110 | False | NaN | False |
5 | Weight_kg | NaN | True | True | False | NaN | False | False | NaN | NaN | ... | NaN | NaN | False | NaN | NaN | True | 5|150 | False | NaN | False |
6 | Height_cm | NaN | True | True | False | NaN | False | False | NaN | NaN | ... | NaN | NaN | False | NaN | NaN | True | 10|200 | False | NaN | False |
7 | BMI | NaN | True | True | False | NaN | False | False | NaN | NaN | ... | NaN | NaN | False | NaN | NaN | True | 5|60 | False | NaN | False |
8 | Postcode | NaN | True | True | True | .|unknown|null|not known|na|n/a|none | False | False | NaN | NaN | ... | False | True | False | NaN | NaN | False | NaN | False | NaN | False |
9 | ICD_10_Code | NaN | True | True | True | .|unknown|Unknown|null|not known|na|n/a|N/A|none | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False |
10 | Tumour_Stage | NaN | True | True | True | None|N/A|Incomplete Stage|Incomplete Stage2 | False | False | NaN | NaN | ... | False | False | True | File | LU_toydf_tumour_stage.csv | False | NaN | False | NaN | False |
11 | Metastatic_Indicator | NaN | True | True | False | NaN | False | True | Tumour_M_Stage | {"Absent": ["M0"], "Present": ["M1", "M1a", "M... | ... | False | False | False | NaN | NaN | False | NaN | False | NaN | False |
12 | Tumour_M_Stage | NaN | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | True | Values | M0|M1|M1b|pM1 | False | NaN | False | NaN | False |
13 | Regimen_Start_Date | %d/%m/%Y | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | [0-9]{2}/[0-9]{2}/[0-9]{2} | False |
14 | Datetime_Event1 | %d/%m/%Y %H:%M | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | (\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2}) | False |
15 | Datetime_Logging1 | %d/%m/%Y %H:%M | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | (\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2}) | False |
16 | Datetime_Event2 | %d/%m/%Y %H:%M | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | [0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2} | False |
17 | Datetime_Logging2 | %d/%m/%Y %H:%M | True | True | False | NaN | False | False | NaN | NaN | ... | False | False | False | NaN | NaN | False | NaN | True | (\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2}) | False |
18 rows × 30 columns
Set Test Parameters#
Now we are ready to set the test parameters. We are using the pre-defined uploaded test parameters
.
dq.set_test_params(test_params_upload)
Using uploaded test parameters
Run DQMaRC#
Once you have set your chosen test parameters
, you run DQMaRC by using the
run_all_metrics()
method. Users can optionally run each
data quality dimension separately as indicated in the commented code below.
dq.run_all_metrics()
# To run separately use following methods.
# dq.completeness.run_metrics()
# dq.uniqueness.run_metrics()
# dq.consistency.run_metrics()
# dq.timeliness.run_metrics()
# dq.validity.run_metrics()
# dq.accuracy.run_metrics()
Get DQ Results#
Now you can export your results. The two main DQ reports generated by DQMaRC include the full cell-level markup and the field-wise aggregated results.
Full Results#
The deepest, and potentially most useful output results is the data quality markup called DQMaRC.DataQuality.raw_results()
.
This includes a cell-level binary markup of data quality errors, or flags.
You can join the output dataset from DQMaRC.DataQuality.raw_results()
to the original source dataset by the index.
raw = dq.raw_results()
raw
# The full results can be joined to the source data by the index.
# source_df_raw = df.join(raw)
Completeness_NULL_|_Patient_ID | Completeness_NULL_|_Gender | Completeness_NULL_|_Date_of_Diagnosis | Completeness_NULL_|_Date_of_Birth | Completeness_NULL_|_Age | Completeness_NULL_|_Weight_kg | Completeness_NULL_|_Height_cm | Completeness_NULL_|_BMI | Completeness_NULL_|_Postcode | Completeness_NULL_|_ICD_10_Code | ... | validity_count_|_Datetime_Event2 | validity_count_|_Datetime_Logging1 | validity_count_|_Datetime_Logging2 | validity_count_|_Gender | validity_count_|_Height_cm | validity_count_|_Postcode | validity_count_|_Regimen_Start_Date | validity_count_|_Tumour_M_Stage | validity_count_|_Tumour_Stage | validity_count_|_Weight_kg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
104 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
105 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
106 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
107 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
108 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
109 rows × 114 columns
Aggregated Results#
You can also access a higher-level aggregation of the raw results.
This contains the sum count of data quality errors detected for each test parameter and each source data field.
To access this, you can simply use DQMaRC.DataQuality.aggregate_results()
method.
agg = dq.aggregate_results()
agg
metric | field | Completeness_NULL | Completeness_Empty | Completeness_Encoded | completeness_count | row_uniqueness | uniqueness_count | Consistency_Compare | Consistency_Date_Relations | consistency_count | Timeliness_Date_Diff | timeliness_count | Validity_Dates_Future | Validity_Date_Range | Validity_Postcode_UK | Validity_Lookup_Table | Validity_Range | Validity_Pattern | validity_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12 | Patient_ID | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | Gender | 11.0 | 0.0 | 7.0 | 18.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | 0.0 |
3 | Date_of_Diagnosis | 2.0 | 0.0 | NaN | 2.0 | NaN | NaN | NaN | 6.0 | 6.0 | NaN | NaN | 1.0 | 79.0 | NaN | NaN | NaN | 0.0 | 80.0 |
2 | Date_of_Birth | 1.0 | 0.0 | NaN | 1.0 | NaN | NaN | NaN | 6.0 | 6.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 0.0 | 1.0 |
0 | Age | 11.0 | 0.0 | NaN | 11.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 50.0 | NaN | 50.0 |
17 | Weight_kg | 11.0 | 0.0 | NaN | 11.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 84.0 | NaN | 84.0 |
9 | Height_cm | 10.0 | 0.0 | NaN | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 34.0 | NaN | 34.0 |
1 | BMI | 9.0 | 0.0 | NaN | 9.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 36.0 | NaN | 36.0 |
13 | Postcode | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | 2.0 |
10 | ICD_10_Code | 2.0 | 4.0 | 1.0 | 7.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | Tumour_Stage | 11.0 | 0.0 | 10.0 | 21.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 28.0 | NaN | NaN | 28.0 |
11 | Metastatic_Indicator | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 49.0 | NaN | 49.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | Tumour_M_Stage | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.0 | NaN | NaN | 7.0 |
14 | Regimen_Start_Date | 1.0 | 0.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | 0.0 | 2.0 |
4 | Datetime_Event1 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 9.0 | 9.0 | 19.0 | 19.0 | 8.0 | NaN | NaN | NaN | NaN | 9.0 | 17.0 |
6 | Datetime_Logging1 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 9.0 | 9.0 | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | 0.0 | 8.0 |
5 | Datetime_Event2 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 2.0 | 2.0 | 103.0 | 103.0 | 8.0 | NaN | NaN | NaN | NaN | 0.0 | 8.0 |
7 | Datetime_Logging2 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.0 | NaN | NaN | NaN | NaN | 9.0 | 17.0 |
18 | full_row_uniqueness | NaN | NaN | NaN | NaN | 2.0 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Data Quality Visualisation#
Here we demonstrate examples of data visualisations to gain quick high-level overviews of the DQ results.
Overall data quality#
The overall quality is the average data quality for all metrics across all fields from the source dataset. The applied logic is: * if average > 90, return “Outstanding” * if average >= 80, return “Good” * if average >= 60, return “Requires Improvement” * if average <60, return “Inadequate”
# Prepare DQ Dashboard
raw_subset = raw.filter(regex='completeness|validity|consistency|uniqueness_count|accuracy|timeliness')
calculator = MetricCalculator(raw_subset)
# Simulate the calculation step, calculate aggregates
calculator.calculate_metrics()
summary_results = calculator.result
summary_results['Colour_Good'] = summary_results.apply(col_good, axis=1)
summary_results['Colour_Bad'] = summary_results.apply(col_bad, axis=1)
summary_results['Colour_NA'] = '#B2C3C6'
# Overall quality label
from IPython.display import HTML
# Function to display overall quality in a Jupyter Notebook
def display_overall_quality_label():
if not summary_results.empty:
data1 = summary_results[summary_results['Prop_NA'] == 0]
avg_prop_good = data1['Prop_Good'].mean()
overall_quality_level, background_color, text_colour = overall_quality_fx(avg_prop_good)
overall_quality_text = f"Overall Quality: {overall_quality_level}"
html = f"""
<div style="
background-color: {background_color};
padding: 10px;
border-radius: 5px;
color: {text_colour};
border: 2px solid {text_colour};
text-align: center;
width: 300px;">
{overall_quality_text}
</div>
"""
return HTML(html)
else:
return HTML("<div style='text-align: center;'>No data available</div>")
# Use the function to display the result
display_overall_quality_label()
Donut Charts#
The donut charts represent the average DQ for all metrics and fields within a given DQ dimension.
DonutChartGenerator(summary_results).plot_donut_charts()
Bar Chart Completeness#
The barcharts represent the average DQ for each field of a given DQ dimension. Here we show all relevant fields’ average completeness quality.
BarPlotGenerator(summary_results, "completeness").plot_bar()
Bar Chart Consistency#
Here we show all relevant fields’ average consistency.
BarPlotGenerator(summary_results, "consistency").plot_bar()
Bar Chart Timeliness#
Here we show all relevant fields’ average timeliness.
BarPlotGenerator(summary_results, "timeliness").plot_bar()
Bar Chart Uniqueness#
Uniqueness indicates if records are unique across the combination of fields chosen.
BarPlotGenerator(summary_results, "uniqueness").plot_bar()
Bar Chart Validity#
Here we show all relevant fields’ average validity.
BarPlotGenerator(summary_results, "validity").plot_bar()
Example Errors by DQ Dimension#
First we must join the source data with the data quality markup called raw. We can do this
using a pandas join method. Now, the df_DQ_full
dataframe contains both source data and the
data quality markup.
# Join source data to the full DQ markup results
df_DQ_full = df.join(raw, how="left")
Completeness Examples#
In this example, we showcase completeness errors present in the Gender
variable of the source dataset.
Values such as .
, Na
, unknown
are flagged as errors, which are represented as 1
in the adjacent fields
such as Completeness_NULL_|_Gender
, Completeness_Empty_|_Gender
and Completeness_NULL_|_Encoded
.
The completeness_count_|_Gender
variable is the sum of these three DQ metrics for each variable (i.e. Gender in this case).
gender_completeness_conditions = (df_DQ_full['Completeness_NULL_|_Gender']>0) | \
(df_DQ_full['Completeness_Empty_|_Gender']>0) | \
(df_DQ_full['Completeness_Encoded_|_Gender']>0)
df_DQ_full[['Gender','Completeness_NULL_|_Gender',
'Completeness_Empty_|_Gender', 'Completeness_Encoded_|_Gender',
'completeness_count_|_Gender']].loc[(gender_completeness_conditions)]
Gender | Completeness_NULL_|_Gender | Completeness_Empty_|_Gender | Completeness_Encoded_|_Gender | completeness_count_|_Gender | |
---|---|---|---|---|---|
2 | . | 0 | 0 | 1 | 1 |
7 | . | 0 | 0 | 1 | 1 |
10 | NaN | 1 | 0 | 0 | 1 |
11 | NaN | 1 | 0 | 0 | 1 |
15 | . | 0 | 0 | 1 | 1 |
26 | . | 0 | 0 | 1 | 1 |
27 | NaN | 1 | 0 | 0 | 1 |
31 | unknown | 0 | 0 | 1 | 1 |
32 | unknown | 0 | 0 | 1 | 1 |
33 | unknown | 0 | 0 | 1 | 1 |
37 | NaN | 1 | 0 | 0 | 1 |
46 | NaN | 1 | 0 | 0 | 1 |
55 | NaN | 1 | 0 | 0 | 1 |
57 | NaN | 1 | 0 | 0 | 1 |
69 | NaN | 1 | 0 | 0 | 1 |
82 | NaN | 1 | 0 | 0 | 1 |
91 | NaN | 1 | 0 | 0 | 1 |
101 | NaN | 1 | 0 | 0 | 1 |
Uniqueness Examples#
Uniqueness evaluates duplicate records for variables where we expect them to be unique.
Here you can see that the row_uniqueness_|_full_row_uniqueness
variable indicates 1
if a duplicate record is present
across the Patient_ID
and Gender
source variables.
# Check which rows have duplication where they should be unique
df_DQ_full[['Patient_ID', 'Gender']][df_DQ_full[['Patient_ID', 'Gender']].duplicated()]
# Show how uniqueness flags are shown
uniqueness_conditions = (df_DQ_full['Patient_ID']==1) | (df_DQ_full['Patient_ID']==10)
df_DQ_full[['Patient_ID', 'Gender',
'row_uniqueness_|_full_row_uniqueness']].loc[(uniqueness_conditions)]
Patient_ID | Gender | row_uniqueness_|_full_row_uniqueness | |
---|---|---|---|
0 | 1 | Male | 1 |
1 | 1 | Male | 0 |
10 | 10 | NaN | 1 |
11 | 10 | NaN | 0 |
Consistency Examples#
We demonstrate consistency errors below by comparing source variables Metastantic_Indicator
and Tumour_M_Stage
.
In the full DQ report, the variable Consistency_Compare_|_Metastantic_Indicator
contains 1
if an inconsistency is
detected between Metastantic_Indicator
and Tumour_M_Stage
based on the mapping provided in the test_params
data.
consistency_conditions_metastatic_indicator = (df_DQ_full['Consistency_Compare_|_Metastatic_Indicator']>0)
df_DQ_full[['Tumour_M_Stage','Metastatic_Indicator',
'Consistency_Compare_|_Metastatic_Indicator',
'consistency_count_|_Metastatic_Indicator']].loc[(consistency_conditions_metastatic_indicator)]
Tumour_M_Stage | Metastatic_Indicator | Consistency_Compare_|_Metastatic_Indicator | consistency_count_|_Metastatic_Indicator | |
---|---|---|---|---|
0 | M0 | Present | 1 | 1 |
1 | M0 | Present | 1 | 1 |
2 | M0 | Present | 1 | 1 |
3 | M0 | Present | 1 | 1 |
4 | M0 | Present | 1 | 1 |
5 | M0 | Present | 1 | 1 |
6 | M0 | Present | 1 | 1 |
7 | M0 | Present | 1 | 1 |
8 | M0 | Present | 1 | 1 |
9 | M0 | Present | 1 | 1 |
10 | M1 | Absent | 1 | 1 |
11 | M1 | Absent | 1 | 1 |
12 | M1 | Absent | 1 | 1 |
13 | M1 | Absent | 1 | 1 |
14 | M1a | Absent | 1 | 1 |
15 | pM1a | Absent | 1 | 1 |
16 | M1a | Absent | 1 | 1 |
17 | M1b | Absent | 1 | 1 |
18 | M1c | Absent | 1 | 1 |
19 | pM1a | Absent | 1 | 1 |
80 | M0 | Present | 1 | 1 |
81 | M0 | Present | 1 | 1 |
82 | M0 | Present | 1 | 1 |
83 | M0 | Present | 1 | 1 |
84 | M0 | Present | 1 | 1 |
85 | M0 | Present | 1 | 1 |
86 | M0 | Present | 1 | 1 |
87 | M0 | Present | 1 | 1 |
88 | M0 | Present | 1 | 1 |
89 | M0 | Present | 1 | 1 |
90 | M0 | Present | 1 | 1 |
91 | M0 | Present | 1 | 1 |
92 | M0 | Present | 1 | 1 |
93 | M0 | Present | 1 | 1 |
94 | M0 | Present | 1 | 1 |
95 | M0 | Present | 1 | 1 |
96 | M0 | Present | 1 | 1 |
97 | M0 | Present | 1 | 1 |
98 | M0 | Present | 1 | 1 |
99 | M0 | Present | 1 | 1 |
100 | M0 | Present | 1 | 1 |
101 | M0 | Present | 1 | 1 |
102 | M0 | Present | 1 | 1 |
103 | M0 | Present | 1 | 1 |
104 | M0 | Present | 1 | 1 |
105 | M0 | Present | 1 | 1 |
106 | M0 | Present | 1 | 1 |
107 | M0 | Present | 1 | 1 |
108 | M0 | Present | 1 | 1 |
We also demonstrate a consistency check between date of birth
and date of diagnosis
.
consistency_conditions_dates = (df_DQ_full['Consistency_Date_Relations_|_Date_of_Diagnosis']>0) | \
(df_DQ_full['Consistency_Date_Relations_|_Date_of_Birth']>0)
df_DQ_full[['Date_of_Birth','Date_of_Diagnosis',
'Consistency_Date_Relations_|_Date_of_Birth',
'Consistency_Date_Relations_|_Date_of_Diagnosis']].loc[(consistency_conditions_dates)]
Date_of_Birth | Date_of_Diagnosis | Consistency_Date_Relations_|_Date_of_Birth | Consistency_Date_Relations_|_Date_of_Diagnosis | |
---|---|---|---|---|
0 | 07/06/2090 | 23/04/2033 | 1 | 1 |
19 | 25/10/2015 | 12/07/2011 | 1 | 1 |
36 | 10/07/2017 | 31/05/2015 | 1 | 1 |
66 | 04/07/2014 | 26/01/2010 | 1 | 1 |
67 | 13/01/2020 | 07/12/2010 | 1 | 1 |
73 | 28/01/2015 | 01/02/2014 | 1 | 1 |
Timeliness Examples#
Here we demonstrate example timeliness errors by comparing the datetime difference between Datetime_Event2
and Datetime_Logging2
.
According to the customised test_parameters
, we set a threshold of 10 minutes to flag an error if the datetime difference is exceeded.
In otherwords, if the difference in time between these variables is greater than 10 minutes, an error, i.e. 1
, is present in the
Timeliness_Date_Diff_|_Datetime_Event2
field of the DQ report.
timeliness_conditions = (df_DQ_full['Timeliness_Date_Diff_|_Datetime_Event2']>0)
df_DQ_full[['Datetime_Event2','Datetime_Logging2',
'Timeliness_Date_Diff_|_Datetime_Event2',
'timeliness_count_|_Datetime_Event2']].loc[(timeliness_conditions)]
Datetime_Event2 | Datetime_Logging2 | Timeliness_Date_Diff_|_Datetime_Event2 | timeliness_count_|_Datetime_Event2 | |
---|---|---|---|---|
0 | 16/03/2024 00:00 | 16/03/2024 00:53 | 1 | 1 |
1 | 19/03/2024 00:00 | 19/03/2024 01:11 | 1 | 1 |
2 | 04/03/2024 00:00 | 04/03/2024 00:18 | 1 | 1 |
3 | 30/03/2024 00:00 | 30/03/2024 01:28 | 1 | 1 |
4 | 29/03/2024 00:00 | 29/03/2024 00:47 | 1 | 1 |
... | ... | ... | ... | ... |
104 | 13/03/2024 00:00 | 01-01-1122 01:25:00 | 1 | 1 |
105 | 13/03/2024 00:00 | 01-01-1122 01:25:00 | 1 | 1 |
106 | 13/03/2024 00:00 | 01-01-1122 01:25:00 | 1 | 1 |
107 | 13/03/2024 00:00 | 01-01-1122 01:25:00 | 1 | 1 |
108 | 13/03/2024 00:00 | 01-01-1122 01:25:00 | 1 | 1 |
103 rows × 4 columns
Validity Examples#
Here we demonstrate example errors of data validity across all 7 validity metrics, including: future dates, outlier dates, invalid NHS numbers, invalid codes, numerical outlires, and invalid patterns.
Future Dates#
Future dates represent if a datetime field occurs in the future relative to the current datetime. This cannot detect a future date if it occurred in the past. Instead, this may be detected as an outlier date or inconsistent date instead.
validity_future_dates_conditions = (df_DQ_full['Validity_Dates_Future_|_Date_of_Diagnosis']>0)
df_DQ_full[['Date_of_Diagnosis','Validity_Dates_Future_|_Date_of_Diagnosis']].loc[(validity_future_dates_conditions)]
Date_of_Diagnosis | Validity_Dates_Future_|_Date_of_Diagnosis | |
---|---|---|
0 | 23/04/2033 | 1 |
Outlier Dates#
Outlier dates are dates that occur outside the ranges as set in the test parameters
.
validity_outlier_dates_conditions = (df_DQ_full['Validity_Date_Range_|_Date_of_Diagnosis']>0)
df_DQ_full[['Date_of_Diagnosis','Validity_Date_Range_|_Date_of_Diagnosis']].loc[(validity_outlier_dates_conditions)]
Date_of_Diagnosis | Validity_Date_Range_|_Date_of_Diagnosis | |
---|---|---|
0 | 23/04/2033 | 1 |
4 | 17/09/2010 | 1 |
6 | 09/01/2013 | 1 |
7 | 19/06/2011 | 1 |
8 | 13/10/2012 | 1 |
... | ... | ... |
104 | 26/11/2010 | 1 |
105 | 30/07/2011 | 1 |
106 | 15/11/2014 | 1 |
107 | 15/03/2018 | 1 |
108 | 13/10/2017 | 1 |
79 rows × 2 columns
Invalid NHS Numbers#
Invalid NHS numbers are flagged when NHS numbers do not meet the requirements as set by the NHS validation algorithm. Here is how you would check in your dataset which NHS numbers were invalid. We excluded this from our example to minimise the risk of sharing potentially true NHS numbers even when synthetically generated.
#validity_NHS_number_conditions = (df_DQ_full['Validity_NHS_Number_|_NHS_number']>0)
#df_DQ_full[['NHS_number','Validity_NHS_Number_|_NHS_number']].loc[(validity_NHS_number_conditions)]
Invalid UK Postcodes#
Invalid UK postcodes can be detected by the UK postcode validation algorithm. Other countries’ postcodes can be validated using pattern validation as shown below.
validity_UK_postcodes_conditions = (df_DQ_full['Validity_Postcode_UK_|_Postcode']>0)
df_DQ_full[['Postcode','Validity_Postcode_UK_|_Postcode']].loc[(validity_UK_postcodes_conditions)]
Postcode | Validity_Postcode_UK_|_Postcode | |
---|---|---|
6 | AAA 1AA | 1 |
8 | BB1 1ABB | 1 |
Invalid Tumour Stage Codes#
This is one example of a validation test applied to a categorical or coded source variable. In this case, we validate Tumour_Stage codes by comparing
against a list of valid codes. We must provide the code list as a csv file and the name of it in the
test parameters
dataset.
validity_codes_conditions = (df_DQ_full['Validity_Lookup_Table_|_Tumour_Stage']>0)
df_DQ_full[['Tumour_Stage','Validity_Lookup_Table_|_Tumour_Stage']].loc[(validity_codes_conditions)]
Tumour_Stage | Validity_Lookup_Table_|_Tumour_Stage | |
---|---|---|
23 | Stage3 | 1 |
27 | Stage3 | 1 |
30 | Stage3 | 1 |
31 | stage4 | 1 |
34 | Stage3 | 1 |
38 | stage 4 | 1 |
42 | Stage3 | 1 |
45 | Stage3 | 1 |
46 | stage4 | 1 |
49 | Stage3 | 1 |
53 | stage 4 | 1 |
57 | Stage3 | 1 |
60 | Stage3 | 1 |
61 | stage4 | 1 |
64 | Stage3 | 1 |
71 | Stage3 | 1 |
74 | Stage3 | 1 |
75 | stage4 | 1 |
78 | Stage3 | 1 |
82 | stage 4 | 1 |
86 | Stage3 | 1 |
90 | Stage3 | 1 |
93 | Stage3 | 1 |
94 | stage4 | 1 |
97 | Stage3 | 1 |
101 | stage 4 | 1 |
105 | Stage3 | 1 |
108 | Stage3 | 1 |
Numerical Outliers#
Numerical outliers are detected by applying a minimum and maximum numerical range in the test parameters
.
validity_numerical_ranges_conditions = (df_DQ_full['Validity_Range_|_Height_cm']>0)
df_DQ_full[['Height_cm','Validity_Range_|_Height_cm']].loc[(validity_numerical_ranges_conditions)]
Height_cm | Validity_Range_|_Height_cm | |
---|---|---|
0 | 220.50 | 1 |
4 | 7.48 | 1 |
8 | 296.97 | 1 |
10 | 210.99 | 1 |
12 | 289.45 | 1 |
16 | 202.02 | 1 |
17 | 227.60 | 1 |
19 | 248.42 | 1 |
21 | 297.19 | 1 |
23 | 249.22 | 1 |
25 | 275.20 | 1 |
26 | 265.48 | 1 |
30 | 217.05 | 1 |
32 | 252.69 | 1 |
33 | 281.95 | 1 |
35 | 283.26 | 1 |
39 | 230.48 | 1 |
44 | 247.59 | 1 |
47 | 253.83 | 1 |
49 | 217.83 | 1 |
52 | 211.40 | 1 |
57 | 240.35 | 1 |
62 | 246.27 | 1 |
71 | 210.46 | 1 |
72 | 255.92 | 1 |
74 | 292.97 | 1 |
79 | 6.35 | 1 |
80 | 256.97 | 1 |
82 | 209.31 | 1 |
84 | 299.91 | 1 |
85 | 231.40 | 1 |
92 | 226.65 | 1 |
99 | 226.65 | 1 |
106 | 226.65 | 1 |
Invalid Patterns#
The pattern validation check uses regular expression pattern. Below, you can see a datetime variable with invalid date formats or patterns.
The test paramaters
are set to only accept patterns that follow DD/MM/YYY HH:MM
.
validity_pattern_conditions = (df_DQ_full['Validity_Pattern_|_Datetime_Event1']>0)
df_DQ_full[['Datetime_Event1','Validity_Pattern_|_Datetime_Event1']].loc[(validity_pattern_conditions)]
Datetime_Event1 | Validity_Pattern_|_Datetime_Event1 | |
---|---|---|
100 | 1991/03/2024 00:00:00 | 1 |
101 | 1991/03/2024 00:00:00 | 1 |
102 | 1991/03/2024 00:00:00 | 1 |
103 | 1991/03/2024 00:00:00 | 1 |
104 | 1991/03/2024 00:00:00 | 1 |
105 | 1991/03/2024 00:00:00 | 1 |
106 | 1991/03/2024 00:00:00 | 1 |
107 | 1991/03/2024 00:00:00 | 1 |
108 | 1991/03/2024 00:00:00 | 1 |